[ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test#37616
[ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test#37616noooop merged 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request aims to fix a flaky test for Cohere/OpenAI embedding parity on ROCm by adding ROCM_EXTRA_ARGS to the test server's configuration. This introduces arguments to disable prefix caching and limit the maximum number of sequences to one on ROCm platforms. While this change successfully stabilizes the test, I have a concern that limiting sequences to one effectively disables batch processing, which undermines the purpose of the test_batch_parity test. My review includes a suggestion to handle this more explicitly to maintain test integrity.
|
Testing MI325 to see if issue is resolved (added |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Test has been confirmed green: https://buildkite.com/vllm/amd-ci/builds/6732/steps/canvas?sid=019d0bd9-2a24-4529-a7c3-4c16a3f66397&tab=output |
| await self._prepare_generators(ctx) | ||
| await self._collect_batch(ctx) | ||
| try: | ||
| await self._collect_batch(ctx) |
There was a problem hiding this comment.
Why is this needed? We now use app level error handlers to convert error responses
There was a problem hiding this comment.
@DarkLight1337 The app-level Exception handler at api_server.py:270 handles dimensions=-1 correctly (returns 400), but for the immediately following dimensions=16 request on the same connection, the same ValueError from pooling_params.verify() escapes the Starlette ExceptionMiddleware and crashes the ASGI app. The client gets APIConnectionError instead of BadRequestError.
…agation Signed-off-by: Andreas Karatzas <akaratza@amd.com>
| for dimensions in [-1, 16]: | ||
| with pytest.raises(openai.BadRequestError): | ||
| await make_request_and_correctness_test(dimensions) |
There was a problem hiding this comment.
May I ask why this test can pass on the NVIDIA GPU?
Or did this test not pass on the NVIDIA GPU?
There was a problem hiding this comment.
In my understanding, NVIDIA GPUs should also suffer from the issue below.
but for the immediately following dimensions=16 request on the same connection, the same ValueError from pooling_params.verify() escapes the Starlette ExceptionMiddleware and crashes the ASGI app. The client gets APIConnectionError instead of BadRequestError.
There was a problem hiding this comment.
The reason the observed issue surfaces more times on ROCm is likely timing. Slower engine processing widens the window where the async generator cleanup and the ServerErrorMiddleware re-raise interact with the keep-alive connection state. On NVIDIA the race window is narrower, so the test may pass consistently or fail only intermittently. That's what I think it's going on.
There was a problem hiding this comment.
That's eye-opening.
How were you able to spot and fix a race condition bug? LOL
There was a problem hiding this comment.
Oh didn't spot it in an exact line inside our stack 😅 I mostly emphasize on the network sluggishness part of our infra these days so I am guessing this is what is going on.
There was a problem hiding this comment.
I have a bad feeling that
- some exceptions won't be caught by the app-level error handlers?
- Or will the async main loop catch exceptions from another coroutine?
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…t#37616) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Follow-up for:
Stabilizes Cohere test that was failing to due batch invariance issues on ROCm. Addresses failure in
mi325_1: Entrypoints Integration (Pooling)Motivation: https://buildkite.com/vllm/amd-ci/builds/6701/steps/canvas?sid=019d07a7-1a2e-4d29-91e7-9eb765bc4904&tab=output
Related:
cc @kenroche